A Chinese Corpus for Linguistic Research

نویسندگان

  • Chu-Ren Huang
  • Keh-Jiann Chen
چکیده

The project being reported on is a sub-project of the on-going research of the CKIP (Chinese Knowledge Information Processing) Group. This group was founded by Hsieh Ching-chun in 1986 and is currently directed by Kehjiann Chen and Chu-Ren Huang (Chang et al. 1989, Hsieh et al. 1989, Chen et al. 1991). The CKIP research is divided into three sub-projects according to their goals: 1) An On-line Lexicon for NLP, 2) A Corpus, and 3) A Parser. The suit-projects are designed to create a self-sufficient and mutual supporting environment for Chinese NLI: The corpus will be the database supporting the electronic lexicon, while the lexicon will be the basic reterence for automatically tagging the corpus. Moreover, both the corpus and the lexicon will support the parser. Our parser adopts the unification-based formalism of ICG (information-based Case Grammar, Chen and Huang 199{)), which encodes all grammatical information on each lexical entry. At this point in time, the lexicon consists of a fully automated earlier version with limited grammatical information and an updated version with complete grammatical version for parsing, qtaere are more than 40 thousand entries in the completed electronic dictionary, which is available on-line in "lttiwan and allows basic pattern-matching searches. There is also a PC version with reduced search capacity available from the Industrial 'l~echnology Research Institute, the primary funding agency of this pilot dictionary project. The updated version now contains roughly 30 thousand entries with complete grammatical information and another60 thousand with basic grammatical categories. Manipulation of lexical information such as addition of entries and specification of detailed grammatical information with respect to each attribute is maintained online (Jian and Chen 1991). The completed 90 thousand word lexicon will be our core lexicon fl)r parsing. The hierarchical arrangement will enable us to efficiently add new entries and create special lexicons for sub~lomains.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Chinese Gigaword Corpus and Chinese Word Sketch in linguistic Research

We explore the possibility of deeper linguistic research based on corpus and computational linguistic tools in this paper. In particular, we adopt Chinese Word Sketch, the application of Word Sketch Engine to Chinese GigaWord Corpus, for linguistic research. We apply Chinese Sketch Engine results to deeper linguistic account such as selectional restriction and event type selection. The study is...

متن کامل

The Standard of Chinese Corpus Metadata

The normalization of corpus metadata plays a key role in building sharable corpora. However, there is no uniform specification for defining and processing metadata in Chinese corpus nowadays. This paper introduces a metadata system we’ve proposed for Chinese corpus. 46 elements are defined in all, which can be divided into 6 classes: information about copyright, information about background of ...

متن کامل

Translation and contrastive linguistic studies at the interface of English and Chinese: Significance and implications

Corpora have revolutionized nearly all areas of linguistic research over the past four decades (McEnery, Xiao and Tono 2006; McEnery and Hardie 2012). Translation studies and contrastive linguistics are no exceptions. Indeed, the rapid development of bilingual parallel corpora as well as monolingual and multilingual comparable corpora since the early 1990s has been of particular relevance and c...

متن کامل

Construction of a Chinese Opinion Treebank

In this paper, we base on the syntactic structural Chinese Treebank corpus, construct the Chinese Opinon Treebank for the research of opinion analysis. We introduce the tagging scheme and develop a tagging tool for constructing this corpus. Annotated samples are described. Information including opinions (yes or no), their polarities (positive, neutral or negative), types (expression, status, or...

متن کامل

Salient Linguistic Features of Chinese Learners with Different L1s: A Corpus-based Study

The study aims to explore the salient linguistic features of Chinese lexical items from different L1s learners. The research method is corpus-based, including comparing the learner corpus and the native-speaker corpus, as well as sub-corpora for different L1s. The learner corpus which consists of more than 1.14 million Chinese words from novice proficiency to advanced learners’ texts is mainly ...

متن کامل

The Lancaster Corpus of Mandarin Chinese: A Corpus for Monolingual and Contrastive Language Study

This paper presents the newly released Lancaster Corpus of Mandarin Chinese (LCMC), a Chinese match for the FLOB and Frown corpora of British and American English. LCMC is a one-million-word balanced corpus of written Mandarin Chinese. The corpus contains five hundred 2,000-word samples of written Chinese texts sampled from fifteen text categories published in Mainland China around 1991, totall...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1992